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Abstract 

This  paper  provider  a  formal  analysis  of  a  powerful  mapping  technique  known  as  scatter  decomposi¬ 
tion.  Scatter  decomposition  divides  an  irregular  computational  domain  into  a  large  number  of  equal  sized 
pieces,  and  distributes  them  modular!)’  among  processors.  We  use  a  probabilistic  mode!  of  workload  in 
one  dimension  to  formally  explain  why,  and  when  scatter  decomposition  works.  Our  first  result  is  that  if 
correlation  in  workload  is  a  convex  function  of  distance,  then  scattering  a  more  finely  decomposed  domain 
yields  a  lower  average  processor  workload  variance.  Our  second  result  shows  that  if  the  workload  process 
is  stationary  Gaussian  and  the  correlation  function  decreases  linearly  in  distance  until  becoming  zero 
and  then  remains  zero,  scattering  a  more  finely  decomposed  domain  yields  a  lower  expected  maximum 
processor  workload.  Finally  we  show  that  if  the  correlation  function  decreases  linearly  across  the  entire 
domain,  then  among  all  mappings  that  assign  an  equal  number  of  domain  pieces  tt>  each  processor,  scat* 
ter  decomposition  minimizes  the  average  processor  workload  variance.  The  dependence  of  these  results 
on  the  assumption  of  decreasing  correlation  is  illustrated  with  situations  where  a  coarser  granularity 
actually  achieves  better-  load  balance. 
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1  Introduction 


Scatter  decomposition  (I),  (also  described  as  modular  mapping  (•!)')  is  an  effective  method  for  parallelizing 
a  large  class  of  irregular  scientific  programs  that  are  tied  to  physical  domains.  Examples  include  a  wide 
variety  cf  techniques  for  numerically  solving  time  dependent  partial  differential  equations,  and  other,  less 
numerical  domain-oriented  simulations.  Scatter  decomposition  divides  the  domain  into  a  set  of  rectangular 
regions  with  the  same  spatial  size  and  geometry.  The  regions  are  labeled  using  Cartesian  coordinates,  and 
arc  mapped  to  processors  by  applying  the  mod  function  to  the  label  in  each  coordinate.  Fcr  example,  Figure 
l  shows  how  a  two  dimensional  irregular  grid  for  a  PDE  is  decomposed  into  strips  (marked  by  the  heavy 
lines)  and  assigned  to  processors.  The  execution  of  all  workload  related  to  a  subregion  is  a  basic  unit  of 
schcdulablc  work  which  we  call  a  cluster.  A  duster's  granularity  is  controlled  by  the  parameters  defining  the 
region  size,  in  this  case  the  strip  width. 

Scatter  decomposition’s  success  lies  in  its  ability  to  balance  workload  without  ever  actually  analyzing  it. 
Any  region  of  high  workload  tends  to  be  subdivided  and  distributed  among  processors.  Scatter  decomposition 
is  a  technique  applied  to  many  problems  in  many  contexts  (1,  2,  -1,  5,  9,  11,  I  I,  17).  Its  success  has  been 
explained  informally  in  (1)  and  [-1],  by  appealing  to  the  physics  and  numerics  of  many  scientific  computations. 
AVI  *hesc  explanations  suffice  for  most  practitioners,  the  literature  lacks  a  full  formal  analysis  of  why  scatter 
dc  isition  balances  workload.  This  paper  provides  some  such  analysis,  identifying  model  assumptions 
under  which  scatter  decomposition  can  be  expected  to  effectively  balance  load.  As  such,  r.ur  work  is  a 
necessary  prerequisite  for  any  future  fonsv.l  treatment  of  the  very  important  problem  of  managing  the 
inherent  tensions  between  load  imbalance  nnd  communication  costs  in  a  scatter  decomposition. 

The  object  of  this  paper  is  to  construct  nnd  analyze  a  performance  model  to  explain  when  and  why  scatter 
decomposition  works.  The  model  is  based  on  a  number  of  simplifying  assumptions  to  promote  tractability. 
As  such,  it  should  not  be  view'd  as  a  model  that,  accurately  predicts  performance  quantitatively.  Rather, 
it  should  be  viewed  as  a  model  that  explains  performance  qualitatively.  Specifically,  we  model  workload  in 
a  one  dimensional  domain  as  a  continuous  second-order  stationary  process.  This  means  that  we  associate 
a  random  workload  with  every  point  in  the  domain,  assume  that  the  mean  workload  at  every  point  is  the 
same,  assume  that  the  workload  variance  at  every  point  is  the  same,  and  assume  that  the  covariance  between 
the  workloads  at  any  two  points  is  uniquely  determined  by  their  distance.  The  model  takes  the  domain  to 
be  divided  into  some  n  =  2J  clusters  of  equal  size,  mapped  modularly  onto  P  —  2P  processors.  Throughout 
this  paper  we  take  P  to  be  fixed,  and  d  >  p.  The  degree  of  the  decomposition  is  defined  to  be  d.  Given  one 
scatter  decomposition,  another  of  higher  degree  ca1-  be  constructed  by  splitting  each  cluster  into  two,  then 
by  modularly  mapping  the  resulting  set  of  clusters. 

We  derive  three  main  results,  each  of  which  has  a  different  set  of  assumptions  concerning  the  correlation 
function. 
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1.  Assumption:  The  correlation  function  is  convex.  Result:  Incrcasi  ,  the  degree  of  a  scatter  decompo¬ 
sition  does  not  increase  the  common  processor  workload  variance. 

2.  Assumptions.  The  workload  process  is  stationary  and  Gaussian.  The  vorrclation  function  decreases 
linearly  until  reaching  zero,  then  remains  zero  (an  elbow  function).  Itesutt:  There  exists  a  degree 
(to,  such  that  if  do  <  d\  <  do,  then  the  expected  maximum  processor  workload  under  a  scatter 
decomposition  of  degree  do  is  no  larger  than  the  expected  maximum  processor  workload  under  a 
scatter  decomposition  of  degree  d\. 

3.  Assumption.  The  correlation  function  decreases  linearly  across  the  entire  domain.  Result:  For  any 
number  of  clusters  2f.  among  all  mappings  that  assign  2,",’  dusters  per  processor  the  modular  mapping 
minimizes  the  average  processor  workload  variance. 

Performance  ultimately  is  measured  in  terms  of  finishing  time,  so  that  the  expected  load  of  the  most 
heavily  loaded  processor  is  nn  appropriate  metric.  One  of  our  results  addresses  this  metric  directly.  Average 
processor  workload  variance  is  a  secondary  measure,  although  intuition  docs  suggest  that  decreasing  the 
variance  while  keeping  the  m.an  constant  will  decrease  the  expected  maximum.  Con.scqui  .itly,  all  these 
results  confirm  our  intuition  that  modularly  mapping  increasingly  finer  grained  workload  lends  to  better 
load  balance.  It  should  be  noted  that  increased  communication  overhead  is  the  price  paid  for  this  balance, 
and  is  a  cost  we  do  not  include  in  this  model.  One  should  not  interpret  these  results  as  saying  that  better 
overall  performance  can  always  be  achieved  by  increasing  the  degree.  For  a  given  domain,  there  will  be  an 
optimal  degree  that  balances  the  conflicting  goals  of  low  communication  costs  and  good  load  balance. 

A  brief  analysis  of  scatter  decomposition  can  be  found  in  (15).  However,  that  analysis  assumes  statistical 
independence  between  all  duster  workloads,  and  seems  to  consider  the  effects  of  scatter  decomposition  c.. 
a  given  architecture  as  the  problem  size  is  increased.  As  such  it  is  an  inappropriate  model  for  studying  tiic 
effects  of  changing  the  mapping  of  a  single  given  problem.  Treatments  of  other  problems  have  used  stochastic 
models  of  v.urkload  to  estimate  the  expected  finishing  time,  but  invariably  those  models  concern  statistically 
independent  workloads,  e.g.  the  analyses  in  (3]  and  (0).  These  result*  are  inadequate  for  analyzing  scatter 
decomposition.  When  all  workload  is  independent,  then  aggregated  workload  is  independent,  and  there  is 
no  performance  benefit  to  be  gained  by  scattering.  Scatter  decomposition  is  successful  precisely  because  the 
workload  is  not  independent.  Our  contribution  is  to  propose  and  analyze  a  model  that  includes  workload 
correlation,  and  explain  why  increasingly  finer  partitions  mapped  modularly  tend  to  balance  the  load  better. 
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2  Analysis 


In  this  section  we  study  a  probabilistic  model  of  workload,  and  the  performance  of  different  mapping*.  For 
the  sake  of  simplicity  we  constrain  our  model  to  be  one-dimensional.  This  assumption  doe*  not  negate 
the  utility  of  the  model;  any  multi-dimensional  problem  pa  tiiioncd  into  hyper-strips  can  be  viewed  a*  a 
one-dimensional  problem.  Such  partitions  greatly  simplify  the  programming  needed  to  exchange  information 
between  processors.  In  fact,  our  experience  in  mapping  a  land-battle  simulation  using  scatter  decomposition 
was  that  strip  partitions  minimised  the  execution  time  (10).  Tin;  way  «No  our  experience  in  mapping  a 
regular  scientific  code  onto  the  Intel  iPSC/l  (1C). 

Our  analysis  concerns  the  effect  of  scatter  decomposition  on  load  baiaiw  the  nb«*nee  of  commu¬ 
nication  or  synchronisation  costs.  By  understanding  how  load  balance  in  isolation  is  a  fleeted  by  ‘he  de- 
compositloii/mapping  decisions,  we  are  better  able  to  understand  the  tension  between  load  imbalance  and 
communication/synchronisntion  overhead*.  The  model  we  use  is  Intended  to  be  descriptive,  rather  than 
predictive;  the  analysis  is  qualitative  rather  than  quantitative.  We  doubt  that  the  end  benefits  of  fitting  a 
model  to  performance  data  will  justify  the  costs  of  doing  so.  Nevertheless  we  feel  there  Is  worth  in  formally 
affirming  the  intuition  behind  scatter  decomposition. 

2.1  When  and  Why  Scatter  Decomposition  Works 

Our  model  explains  the  success  of  scatter  decomposition  by  showing  that  It  induces  correlation  between 
processors'  workloads.  To  see  the  performance  benefits  of  correlated  workloads,  imagine  that  a  random 
workload  is  generated  and  partitioned  so  that  the  dame  amount  of  work  is  assigned  to  every  processor.  A 
processor's  workload  is  random,  but  all  processors  always  finish  at  the  same  time,  because  their  workloads 
arc  perfectly  correlated.  This  situation  is  optimal,  because  all  processors  are  busy  all  the  time.  Now 
imagine  that  the  workload  at  every  point  is  statistically  independent  of  any  other.  No  matter  what  the 
domain  decomposition  or  mapping,  processor  workloads  arc  statistically  independent.  In  fact,  the  expected 
maximum  processor  workload  is  the  same  regardless  of  granularity,  so  long  as  the  same  volume  of  domain 
is  assigned  to  each  processor.  The  "ideal"  of  random  but  highly  correlated  processor  workloads  cannot  be 
achieved  in  this  artificial  scenario. 

Scatter  decomposition  works  because  irregular  workloads  are  not  statistically  independent:  high  workload 
tends  to  appear  in  contiguous  regions.  A  sufficiently  fine-grained  decomposition  will  split  the  region  up, 
modular  assignment  will  spread  its  workload  around.  The  contribution  of  that  region  to  one  processor’s 
workload  is  highly  correlated  with  the  contribution  of  a  nearby  region  to  a  different  processor’s  workload.  If 
the  underlying  workload  is  highly  correlated  in  nearby  regions,  then  scatter  decomposition  induces  correlation 
between  processors’  workloads.  We  have  observed  this  phenomenon  in  our  own  experiments  with  a  one- 
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dimensional  fiui  ’  flow  computation  using  adaptive  grldding,  (11).  The  fluids  problem  exhibits  irregular  grid# 
similar  to  these  in  Figure  1. 

For  a  given  problem,  the  sample  autocorrelation  function  (12)(p.  *13?)  is  a  statistical  estimate  of  correlation 
between  point  workloads,  as  a  function  of  the  distance  between  them.  Autocorrelations  range  between  1 
and  -1;  the  larger  the  autocorrelation,  the  mem-  similar  the  workloads  of  two  points  ala  given  distance  tend 
to  be.  Zero  correlation  implies  statistical  independence;  increasingly  negative  correlations  imply  Increasing 
dissimilarity  between  workloads.  Figure  2  shows  the  sample  autocorrelation  function  at  one  time-step  in  a 
fluid  flow  computation.  Not  only  does  correlation  diminish  as  a  function  of  distance,  it  can  reasonably  be 
modeled  as  a  convex  "elbow"  function  «f*(i)  ~  o'1  max{0, 1  -  ot)  over  an  appropriate  range  of  t,  and  some 
o  >  0.  This  corresponds  nicely  with  two  of  our  results,  one  of  which  assumes  elbow  correlation,  the  other  of 
which  assumes  a  convex  correlation  function. 

There  are  situations  where  scatter  decomposition  will  not  work  well.  Consider  a  one  dimensional  domain 
discretized  into  1000  points,  numbered  between  0  and  099,  to  be  mapped  onto  ten  processors.  Randomly 
choose  some  "bose"  number  b  G  (0,99),  and  imagine  that  every  hundredth  point  beginning  with  6  has  a 
computational  cost  of  1000,  while  all  other  points  have  a  computational  cost  of  i.  If  one  evenly  divides  the 
domain  into  ten  subregions  and  maps  them  modularly,  every  processor  has  1099  units  of  computation  to 
execute.  Scatter  a  decomposition  of  twenty  subregions,  and  half  the  processors  each  have  a  computational 
cost  of  2098,  while  the  other  half  each  have  a  cost  of  100.  Modularly  assign  each  point  individually,  and 
processor  (6  mod  10)  lias  a  cost  of  10090,  while  every  other  processor  has  a  cost  of  100.  In  this  situation 
mapping  increasingly  finer-grained  workload  leads  to  decreasing  performance.  Due  to  b't  randomness  this 
workload  model  is  stochastic,  and  is  second-order  stationary.  Two  points  at  a  distance  100m  for  m  =  1,...,9 
will  always  have  the  same  workload.  The  correlation  function  at  all  distances  lOOrn  consequently  has  value 
one.  It  has  some  fixed  smaller  value  for  all  other  distances.  The  principle  reason  this  problem  defeats  fine¬ 
grained  scatter  decomposition  is  the  periodicity.  One  should  be  extremely  careful  using  scatter  decomposition 
in  the  presence  of  strong  periodic  behavior,  if  there  is  any  chance  that  the  periodicity  of  the  modular  mapping 
can  align  with  the  periodicity  of  workload.  The  assumptions  of  the  models  we  study  do  not  admit  periodicity. 

2.2  Model  Preliminaries 

We  consider  the  behavior  of  a  computation  over  a  real  line  interval,  divided  into  n  clusters,  and  mapped  onto 
P  processors.  Doth  n  and  P  are  taken  to  be  powers  of  two,  and  n  >  P.  We  are  interested  in  the  average 
processor  workload  variance,  and  in  tile  expected  workload  of  the  processor  that  takes  the  longest  time  to 
complete.  Without  loss  of  generality  we  take  the  real  interval  to  be  (0, 1).  Assume  that  every  point  p  G  (0, 1) 
lias  a  certain  wort  inicnsily  IV(f).  The  time  required  to  process  (a, 6)  is  the  integral  of  W(t)  from  1  =  a  to 
t  =  6.  We  assume  that  the  intensities  IF(<)  are  unknown,  but  we  are  willing  to  model  our  uncertainty  by 
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assuming  (hat  tlf(t)  is  a  random  variable,  and  shat  W(l)  can  be  viewer!  as  a  second-order  stationary  process 
(13)  over  i  6  [0.  J).  Thus  wc  suppose  that  £(l  $’(<)]  =  /i  for  all  l  €  (0, 1],  that  W(U‘(0) 53  tf3  for  all  1 6  (0, 1), 
and  that  Cot (t !*(()«  n‘(*)l  depends  only  on  [t  -  <1.  To  emphasise  this  point  we  will  denote  the  covariance 
function  as  Coe((t  -  s|).  These  assumptions  arc  reasonable  if  we  are  unwilling  or  unable  to  differentiate 
between  the  likely  behavior  of  the  computation  at  t  and  at «.  Wc  do  not  assume  that  W(0  =  WC*). 
simply  assume  that  we  have  the  same  degree  of  uncertainUy  about  H'(0  *«d  IF(jt). 

The  execution  time  for  [n.t]  is 

r» 

T(o,i)  =  J  H*(t)  dt. 

has  mean  value  (t  -  «)//.  The  variance  of  T(o,6)  is 


V‘or ('/'(«,  6))  =  £(TM)a)  -  (6  -  a)3/;3 

=  B^j'\V(t)dl)ij'w(9)di)  -(t-o)V 

=  t  t  E[W(t)[m)dU$  -  (4-«)V 

J# 

ai 

Co«(|«  -  f|)  dt  dg  -  (6  -  a)3/!3. 


0) 


Following  a  decomposition  into  n  dusters, '  '  duster’*  workload  is  2’(i/n,(f+ and  is  denoted 

as  c/(n).  The  random  vector  of  duster  workloads  is  denoted  C(n)  =<co(n)l...1c*_i(n)>. 

We  are  interested  in  the  covariance  matrix  <r£  for  the  duster  wr.kloads.  For  i  yt  j  we  liave 

Coo[ci(n),  c/(n)J  =  {a%)ij  ~  /  /  C>iv(t  -  $)  dl  ds,  (2) 

J}/" 

l'nr[cj(n})  is  simply  l'or(7’( +  !)/«)],  given  above.  The  sequence  co(«),Cj(n),...lcn-i(«)  is  second- 
order  stationary,  a  fact  easily  deduced  from  equations  (1)  and  (2).  To  emphnsixe  this  we  define  the  function 
* 

<5(1;  -  »J.  »>)  -  Coti[cj(n),c/(fi)]. 


Note  that  <5(0,  n)  is  a  duster’s  variance. 

An  assignment  of  clusters  to  processors  is  described  by  a  P  x  n  assignment  matrix  whose  ij-th  entry 
is  1  if  c;(«)  is  assigned  to  processor  i,  and  is  0  otherwise.  Given  assignment  matrix  .4,  the  multiplication 
.4C  yields  &  P  x  l  random  vector  whose  jth  component  is  the  sum  of  the  execution  times  of  all  dusters 
assigned  to  processor  j.  The  vector  of  mean  processor  loads  is  the  matrix- vector  product  Afi„,  where  fin  is 
the  n  element  vector  with  ft/n  in  each  coordinate.  The  covariance  matrix  of  AC  is  the  product  ,4cr£.4Tl 
where  AT  is  the  transpose  of  A.  The  overall  execution  time  is  the  maximum  processor  execution  time,  or 
nnx{(-4C)r}.  This  quantity  is  random. 


5 


For  any  p  )cc«or  P<,  let  -4(i)  denote  the  set  of  clusters  assigned  to  it  under  A,  and  let  t,i(A,  n)  be  Pi' s 
random  workload.  By  definition  the  variance  of  In(A,n)  is  g'ven  by 

t'«r(£<(.4ln))  =  (.4<r£.4T)<( 

=  £  <XM  +  E  M -*!.•«)•  (3) 

The  first  component  of  this  expression  is  the  sum  of  variances  of  all  clusters  assigned  to  Pt.  The  second 
component  is  a  sum  of  c latter  cc variance  terms  (we  will  call  these  cc  terms),  that  depends  on  the  assignment. 
Similarly,  the  covariance  between  processors  L((A,n)  and  /,j(.4,u)  is  given  by  a  sum  of  cc  terms: 

Cov[Li(A,n),I,j(A,n)]=  E  (-1) 

The  sum  of  all  cluster  covariance  matrix  terms  always  equals  the  sum  of  all  processor  .vorkload  variances 
and  covariances* 

n-ln-i  P-lP-l 

£  £('hh = £  £(M^rk- 

(bO  JaO  <a0  ]» 0 

This  implies  a  balance  between  processor  workload  variances  and  covariances  (and  hence  correlations);  if  by 
changing  *4  we  reduce  the  average  processor  workload  variance,  then  wc  are  increasing  the  average  inter* 
processor  workload  correlation. 

The  indices  of  the  sums  (3)  and  (4)  have  special  structure  when  A  describes  a  modular  mapping.  We 
know  that  if  c/(n)  and  c*(»)  arc  assigned  to  the  same  processor,  then  |;  -  Jt|  is  a  multiple  of  P.  Under  a 
modular  mapping  each  processor  will  have  nfP  clusters.  Among  these  there  arc  n/P  -  1  pairs  of  clusters 
whose  indices  are  exactly  P  apart,  n/P -2  pairs  whose  indices  are  exactly  2P  apart,  and  so  on.  Since  n  and 
P  determine  the  specifics  of  the  mapping  we  may  drop  the  notations!  dependence  of  L{(A,n)  on  A.  Under 
the  modular  mapping  we  may  write  the  common  processor  workload  variance  as 

(«//>)- 1 

y«r[L(n)]  =  (n/P)^(0,  n)  +  2  £  ((»/P)  -  Jfc)^P,  n).  (5) 

t=i 

To  consider  processor  workload  covariance  under  a  modular  assignment  take  i  <  j,  and  consider  a 
cluster  ca(n)  assigned  to  processor  P{.  It  has  cc  terms  with  all  processor  P/  clusters  cm(n)  such  that 
|a  -  m|  mod  P  =  ;  -  f  or  |o  -  m|  mod  P  =  P  —  j  +  i.  There  are  ((n/P)  —  k)  cc  terms  arising  from  clusters 
whose  indices  are  kP  +  j  —  i  apart  (for  Jfc  =  0, . . . ,  (n/P)  —  1);  there  are  ((n/P)  —  ib)  cc  terms  arising  from 

clusters  whose  indices  are  JtP  —  j  +  i  apart  (for  k  =  1 . (n/P)  -  1).  We  may  therefore  write 

(n/P)-l  (n/P)-l 

CoulLiOO.Ljin)]  =  £  ((n/P) -*)<J(*P +  ;-.>)+  £  ((n/P)  -  k)t(kP  -  j  +  i,n) 

1=0  Jb=l 

1  tliis  conservation  law  proved  to  be  invaluable  when  debugging  detailed  expressions  for  the  processor  workload  variance  and 
covariances,  e.g.  (12)  and  (13). 
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(«//*)-» 

=  (n//»w(i -*■.»)+  E  ((«/^) -*)«*/* +;-•'.»)+ 

tat 

E  (WP)~w*r-j +•*.«).  m 

km\ 


2.3  Decreasing  Workload  Variance 


Under  very  general  assumptions  one  can  show  that  Increasing  the  degree  or  a  scatter  decomposition  reduces 
the  common  processor  workload  variance.  The  necessary  assumptions  are  that  the  workload  process  be 
second-order  stationary,  and  tiiat  its  covariance  function  be  convex. 

The  first  step  is  to  show  that  <5(|j  -  i|,n)  is  a  convex  function  of  |i  -  ij  over  the  range  1,2 . n  -  1. 

Towards  this  cud  assume  that  x  >  1/n  and  define 


'  /•»/«  /•*+!/« 

/(,„*)  =  *  jf  l 


W(*)\V(t)  dt  d$ 


run  rx+i/n 

—  /  Cou(t  -  *j  dt  dt 

■  H 


i  foa 

Cov(t  -t)  dt-  /  Cov(t  - 1) 
J*+ r/n 

Taking  the  derivative  with  respect  to  x  we  find  that 


dt 


ds 


0  /“>/" 

—/(»,*)  =  j  (Cou(x  +  l/n-s)-Cou(x-s))  dt. 


Tiic  difference  being  integrated  increases  in  x  due  to  Cou(i)  convexity,  implying  that  the  derivative  of  /(n,x) 
with  respect  to  x  increases  in  x— one  characterisation  of  a  convex  function.  By  stationarity  Cov[c((n),Cf(n)] = 
Cou[co(n),c|j_j|(n)];  furthermore  Cou[co(n),c|y_(|(n)]  =  /(»,  |;  —  i|/n).  Consequently  Cou[ci(»)iC/(n)]  **  * 
convex  function  of  |jf  - 1|  once  |j  -  i|  >  1  (it  may  indeed  be  convex  over  the  entire  range,  but  that  fact  has 
not  been  shown,  and  is  not  needed). 

We  are  interested  in  the  effects  of  moving  from  a  scatter  decomposition  with  degree  d  —  1  to  one  with 
degree  d.  To  analyze  these  effects  we  make  the  following  observation.  Consider  a  domain  partitioned  into 
n  =  2J  clusters,  which  is  mapped  by  modularly  assigning  pairs  of  clusters:  co(n)  and  cj(n)  are  assigned  to 
processor  0,  cj(n)  and  ca(n)  are  assigned  to  processor  1,  and  so  on.  This  mapping  is  Identical  to  the  scatter 
decomposition  of  degree  d—  1;  the  pair  of  clusters  c<j(n),cj(n)  viewed  from  tiled  degree  mapping  is  the  same 
as  the  single  cluster  co(n/2)  viewed  from  the  d- 1  degree  mapping.  We  will  show  that  the  modular  mapping 
with  degree  d—  1  produces  processor  variances  that  are  no  smaller  than  those  of  the  modular  mapping  with 
degree  d. 

Split  each  cluster  c,-(n/2)  into  two  equal  sized  clusters.  The  sum  of  the  two  split  cluster  variances  plus 
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twice  their  covariance  must  equal  the  variance  of  c,(n/2).  That  is, 

$(0,  n/2)  =s  2^(0, »»)  +  2^(1,  n).  (?) 

Similarly,  take  two  clusters  c,(ii/2)  and  cj(n/2),  and  split  each  into  two  equal  sited  clusters.  The  total 
covariance  between  the  four  split  clusters  must  equal  the  covariance  between  the  two  unsplit  clusters.  Thus 

<Kli  -  «1. «/2)  =  2<5(2|j  -  i|, it)  +  <5(2|j  - 1 1  + 1 ,  n)  +  <S(2|  j  -  i|  -  I.  »)•  (8) 

Note  that  the  index  values  must  double  when  taken  with  respect  to  »  rather  than  n/2  clusters. 

Substituting  the  right-hand-sides  of  equations  (?)  and  (-S)  into  equation  (5)  and  working  through  the 
algebra,  we  find  that 

^[£(11/2)]  =  (n//>)*(  0, «)  +  (n/P)<f(\,n)  +  2  ((..//>)  -  2b)<S(2h/\n)+ 

kn\ 

£  (( n/P )  -  2 k)  (<5(2J (/>  +  1, «)  +  <5(2it/>  -  l.t*)). 

ini 

Using  this  expression  and  (5),  we  compute  the  difference  Var^n/a))  -  V'«r(£(n)].  All  terms  involving 
<5(2 kP,n)  cancel,  for  k  —  0,...,«/(2P)  -  1.  Each  remaining  term  from  Vor(£(n))  has  the  form  2((n/P)  - 

Ik  -  l)<s((2i  +  n),  for  k  =  0 . n/(2 P)  -  1.  We  split  each  such  term  into  the  sum  (n/P  -  2fc)<5((2 k  + 

l)P>n)  +  (n/P  “» 2b  -  2)^((2b  +  l )P,  n),  and  pair  these  with  t'or(£(ii/2)]  terms  as  follows1 

(»/tJ  /»))-> 

Uor(£(n/2))  —  t^nr(£(n)J  a  £  ((n/P  -  2k)(4(2kP  +  l,»)-<5((2b  +  l)P,n))  - 

ka  0 

(n/P  -2k -2)  (<5((2 k  +  1  )P,  n)  -  <5((2 k  +  2)P  -  1, «))  ) .  (0) 

One  characteristic  of  a  convex  function  g  is  that  for  fixed  y  the  difference  g(x)  —  g(x  +  y)  is  a  decreasing 
function  of  x.  Every  two  terms  we  have  paired  differ  in  their  index  arguments  by  exactly  P  —  1,  e.g., 
$(2kP  +  1,  n)  and  <5((2 k  +  1)/J,  n).  Since  <f>  is  a  convex  function  of  the  index  argument  once  the  index  is  at 
least  1,  we  have  for  every  k 

<5(2 kP  + 1,  n)  -  <5((2 k  +  1  )P,  n)  >  <5((2b  + 1  )Pt  n)  -  <5((2t  +  2)P  -  1, n). 

The  left-hand-side  expression  in  this  inequality  is  weighted  more  heavily  in  equation  (9)  than  is  the  right- 
hand-side  expression.  It  follows  that  Var(Z,(r»/2)]  —  l,«r[£(n))  >  0,  proving  our  first  result. 

Theorem  1  Suppose  the  workload  process  l V(t)  is  second-order  stationary  wilh  a  convex  covariance  func¬ 
tion.  Then  increasing  the  degree  of  a  scalier  decomposition  does  not  increase  the  processor  workload  variance. 

U 
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2.4  Decreasing  Expected  Maximum  Workload 


Next  we  demonstrate  circumstance#  where  increasing  the  degree  of  a  scatter  decomposition  reduce#  the 
expected  workload  of  the  most  heavily  loaded  processor.  The  argument  Is  to  show  that  under  appropriate 
assumptions  the  correlation  between  any  two  processor#'  workloads  increase#  a#  the  degree  increase*.  We 
then  cite  a  result  from  the  literature  proving  that  the  expected  maximum  decrease#  in  this  situation. 

We  assume  that  the  workload  process  (U'(0)  Is  a  stationary  Gaussian  process3  (7).  Additivity  properties 
of  the  Gaussian  then  ensure  that  the  vector  of  n  clusters  has  a  jointly  normal  distribution  [?](Chapter  6)  and 
that  under  any  assignment,  the  processors'  workloads  are  jointly  normal.  Wc  also  assume  that  the  correlation 
function  Is  Cou(f)  =  <r3max{0, 1  -  at),  where  a  =  2 v/wh  >  1  for  some  integers  v, iji„  >  0.  The  restriction 
on  a  is  used  to  simplify  certain  calculations.  6  —  1/a  is  the  smallest  distance  f  at  which  Ceu(t)  =  0.  Our 
results  apply  when  the  degree  is  large  enough  so  that  subinterval  (0,5)  is  partitioned  into  at  least  P  =  2* 
clusters.  If  the  degree  is  </,  then  the  number  of  clusters  in  (9,£j  is  624.  Now  let  do  be  the  least  d  such  that 
S2H  mod  2**  =  0.  Equivalently,  dg  is  the  least  integer  d  such  that  w02'i~^~v  is  an  integer.  Clearly  do  <  p+  v. 
Our  results  apply  when  the  degree  is  at  least  d0. 

We  can  compute  functional  forms  for  ^(|;  -  «j,n)  given  this  explicit  definition  of  Cou(t).  Performing  the 
integration  given  by  (2)  one  determines  that 

’  £(n-a|;-f|)  if  1/  —  f|  <  tin 

<S(|;-i|,n)=  j  &  if|;-«‘|  =  «n  .  (10) 

0  if  | j  -  i|  >  Sn 

These  calculations  take  advantage  of  the  fad  that  6  is  a  multiple  of  1/n.  The  variance  of  a  cluster  is 
determined  by  evaluating  (1),  yielding 

^(°»»)  =  (n) 

Given  equations  (10)  and  (1 1)  we  can  compute  processor  workload  variance  and  covariance  under  scatter 
decomposition.  General  expressions  for  these  quantities  are  given  by  (5)  and  (6).  For  large  values  of  k,  some 
terms  in  those  sums  vanish,  being  aero.  Our  assumption  that  the  scatter  decomposition  lias  degree  do  or 
larger  ensures  that  terms  which  vanish  are  easily  characterized.3  and  that  those  clusters  whose  indices  arc 


exactly  Sn  apart  arc  assigned  to  the  same  processor.  Ail  $(kP,  n)  terms  in  (5)  vanish  for  k  >  SnjP\  we  have 
$[kP,  n)  —  tr3a/(6n3)  for  k  —  Sn/P.  We  may  rewrite  the  variance  as 


Var[L{n))  =  (n /P) 


g3(n  -  q/3) 


£  ((»/'’  -  t)<’I(,'~,at,>)) + ("/p  -  w)  (0) 


3nole  dial  ilii*  assumption  it  stronger  than  we  have  used  so  far,  due  both  to  station arity  rattier  titan  second-orderstatlonarity, 
and  due  to  the  assumption  of  a  specific  workload  distribution 

3 This  it  not  the  case  for  smaller  degrees.  A  large  number  of  special  cases  must  be  constructed  and  analyzed.  This  task 
seemed  to  us  to  be  more  tedious  than  is  warranted  by  tlie  anticipated  correspondingly  stronger  result. 
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Calculation  of  this  equality  l*  much  s.  .uplifted  with  the  use  of  a  symbolic  mathematic*  package. 

The  proceMor  workload  covariance  l*  similarly  handled.  Assume  that  i  <  j  k~  Sn/P  again  delineate* 
where  d  terms  vanish:  4[kP-rj  -  i,  «)  =  0  for  all  k  >  SnfP,  and  4(kP  -  «  +  J.O)  =  0  for  all  k  >  Sti/P.  We 
may  rewrite  (6)  a* 


Cotto(n).  £,(«.))  =  («//*)^(n-e(j-i))+  £  (("/?)-*)- 

*■1 

k»i 


e3(n  -  o(kP  +  j  -  /)) 


V  i"  j.i>  pn*J 


The  correlatioi*  between  />s(a)  and  Lj[n)  i*  the  ratio  Co v(/^i(n) ,  A / (n))/l' rtr(4(n)}.  For  all  <1  >  do  *e 
obtain  the  correlation  using  (13)  and  (12),  and  can  treat  the  ratio  a*  a  continuous  function  of  n.  It  it 
interesting  to  note  that  a*  »  increases  the  correlation  approach®;  unity.  This  support*  our  intuition  that 
partitioning  the  domain  into  increasingly  finer  cluster*  and  mapping  them  modular!)'  induce*  correlation 
between  processor  workloads.  In  fact,  the  tendency  toward*  unity  i*  monotonic.  Taking  the  derivative  with 
respect  to  n  we  find  that  the  derivative  is  positive  if 

0/3  -  25/3)  (j  -  i)  +  25/0  -  2/3  >  0. 

This  inequality  hold*,  since  (d/3  —  25/3)  >  2/3.  Consequently,  for  ail  n  =  24  >  2J*  we  must  have 

Cou[/,j(2«),  &i(2n)]/^ar[£(2n))  >  Cou(/,i(n).  £j(n)]/1'ar(£(n)). 

Next  we  use  this  relationship  to  analyze  the  expected  maximum  processor  workload. 

The  following  result  is  based  on  the  Normal  Comparison  Lemma  (8j(p.8l)  and  is  the  key  to  our  observa¬ 
tions  concerning  tiie  expected  maximum  processor  workload. 

Theorem  2  (Lcadbcttcr  et  al.)  Let  5oi  •  •  •  ,5k  be  standardized  jointly  normal  random  variables,  and  lei 

Ua . qt  be  standardized  jointly  normal  random  variables,  such  that  Cov((i,(j)  <  Cou(t]t,  jy)  for  each  i,j, 

i  $£  j.  Then  for  every  u, 

Pr{  ma-..o,...i5k)  <  «))  <  Pr{max{r/o,...,»M}  <  «}}  , 


and  hence 


£[max{5o, ...,&})>  £[max{jjo,  ••.,%}). 


□ 

The  standardization  of  a  random  variable  X  is  the  scaled  random  variable  Z  —  (A*  -  m)/<«  where  rn 
and  s  are  A*'s  mean  and  standard  laviation,  respectively.  The  mean  of  a  standardized  random  variable  is 
zero  and  its  variance  is  one;  the  covariance  between  two  standardized  random  variables  is  the  correlation 
bcl’vecn  their  corresponding  unstandardized  forms.  Let  2{(n)  be  the  standardized  workload  of  processor  Pi 
given  a  domain  of  n  dusters.  Cov[Zt[n),Zj(n))  zs  Cou(L<(>i),  Lj(rO)/t,<,rl^(H))<  which  we  have  shown  to  be 
increasing  in  n.  If  ft  >  n  (equivalently,  if  one  scatter  decomposition  has  higher  degree  than  another),  then 

£(max(&o(n) . £/.-,(«)))  >  £(max{£0(ii) . £/»-,(*)}]•  (H) 

The  expected  maximum  workload  is 

£(max{Lq(n) . Lp-i(»)}]  =  +  Vor[L(n)),/3/fi(n))) 

=  /t/P+l'nr(L(»)),/3£{osmax_j(£,(«)}). 

Theorem  1  shows  that  V«r(L(n))  >  Vor[L(2n)J;  this  along  with  inequality  (H)  proves  our  second  result. 

Theorem  3  Let  (tt?(t)}  be  a  stationary  Gaussian  ptvccss,  with  a  covariance  function  Cov[t)  =  <r2  max{0, 1- 
ot),  where  ct  s  2w/rn0  >  1  for  some  positive  integers  r»„,  u.  Let  there  be  2P  processors,  and  let  do  be  the 
least  integer  d  such  that  in02i~!'~v  is  an  integer.  If  dj  >  dj  >  do,  then  the  exfiected  maximum  processor 
workload  of  a  scatter  decomposition' with  degree  do  is  no  greater  than  that  of  a  scatter  decomposition  with 
degree  dj . 

□ 


2.5  Minimization  of  Average  Workload  Variance 


Our  final  result  gives  conditions  where  for  a  given  n,  among  all  "balanced"  assignments— those  placing  n/P 
clusters  per  processor— the  modular  mapping  minimizes  the  average  processor  workload  variance.  To  prove 
this  result  we  assume  that  the  covariance  function  decreases  linearly  across  the  entire  domain;  Cov(s)  = 
tr3(l  -  os),  for  some  ct  satisfying  0  <  a  <  2.  The  result  is  based  on  a  procedure  that  takes  any  assignment 
and  constructs  another  whose  sum  of  processor  workload  variances  is  no  larger  The  repeated  application  of 
this  procedure  produces  a  modular  assignment.  Consequently,  modular  assignments  minimize  the  average 
processor  workload  variance. 

The  arguments  to  follow  specify  individual  covariance  terms.  These  arguments  are  clearer  using  the 
Coo[c((n),  Cj(n)]  notation  rather  than  $(|j— »|,  n).  It  is  straightforward  to  determine  the  form  of  Cot>[cf(n),  c^(n)) 
under  the  present  assumptions; 


Cou[c;(n),Cj(ri)]  = 


-  «l;  -  »l) 
£("-«/ 3) 


if  li — «|  >  0 
if  li  —  *|  =  0  ' 


(15) 
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Let  be  any  assignment  matrix  describing  a  balanced  assignment.  Without  loss  of  generality,  we 
assume  that  under  .4i  the  processors  are  numbered  so  that  Pq  is  assigned  co(«),  P j  i*  assigned  the  smallest 
indexed  c<(n)  that  is  not  assigned  to  Pq,  and  in  general  Pj  is  assigned  the  smallest  indexed  cluster  that  is 
not  assigned  to  any  of  Poi.Pi . Pj-i- 

We  will  say  that  c/(u)  is  in  phu  if  it  is  assigned  to  processor  Pj  p.  Note  that  all  clusters  are  in  place 
under  a  modular  assignment.  We  construct  another  balanced  assignment  ^3  by  finding  the  smallest  indexed 
Cj(»)  that  is  not  in  place,  and  by  putting  it  in  place.  Let  c/  denote  this  cluster,  let  Ps  denote  the  source 
processor  that  has  c/  under  and  let  Pp  denote  the  target  processor  Pj  /».  Let  c*  be  the  smallest 
indexed  duster  assigned  to  Pp such that  g>  /.  Ai  is  constructed  from  .di  by  giving  c /  to  Pr,  and  c*  to  Ps- 
Figure  3  illustrates  these  definitions.  We  will  prove  that  the  sum  of  processor  variances  under  Ay  bounds 
that  sum  under  «4t  from  below;  consequently  the  average  workload  variance  under  Ay  is  no  greater  than 
that  under  .Aj. 

Recall  that  under  any  assignment  matrix  A  the  variance  of  /V*  work  load  is  given  by 
=  (AalAT){{ 

=  £  £  Cot;(Ci(r.),C*C»)].  (16) 

«,(")€*<«)  <Cj(r.).C*(r»»C-<t0K^(0 

and  that 

Cou(/,<(.4,n),£y(.4,n)]  =  £  Cou(c4(n),  cm(n)) 

<<4  >  €w4(S)K<4(i) 

It  is  clear  from  (16)  that  the  variance  of  any  processor  other  than  Ps  or  Pp  it  by  unaffected  by  swapping  c / 
and  cf.  To  prove  the  desired  result  we  need  only  show  that  the  swap  docs  not  increase  the  sum  of  Pp  and 
P$  variances.  The  change  in  processor  variances  caused  by  the  swap  is  entirely  due  to  changes  in  the  sum  of 
cluster  covariance  (cc)  terms  in  each  processor.  After  swapping  c /  and  ct,  each  cluster  c/(n)  assigned  to  Ps 
loses  the  cc  term  Cou(cy(n),c,(»))  and  gains  the  term  Cou[c#(n),Cj(n)].  We  let  Ag,  denote  the  sum  of  all 
such  changes  among  clusters  in  Ps  to  the  left  ofc/,  and  let  Ls  denote  the  number  of  such  clusters.  Similarly 
A/?,  denotes  the  sum  of  changes  among  clusters  in  Ps  to  the  right  of  ct  and  Rs  denotes  the  number  of  such 
clusters;  A ,\ts  denotes  the  sum  of  changes  among  clusters  in  Ps  with  indices  between  /  and  g.  Expressions 
for  these  quantities  are  derived  using  equation  (15): 

Ais  =  £  (Cou(C|(n),c,'(n)3  -  Cou[c,(n),cj(n)])  =  -  ^(g  -  f)Ls*\ 

<</ 

2 

A**  =  £  (Cou[c#(n),c/(n)]  -  Coulc/W.c^n)])  =  ^(g  -  f)Rsa\ 

c,(*)€^3  (S) 
j>f 

&Ms  =  £  (CouMn),  cfc(n)j  -  Cov[cj(n),  ct(n)])  =  £  (2k  -  f  -  g)a. 

«k(«)€^j(s)  «*(»)€•*  j(s> 

/<*<f  /<*<» 
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The  change  in  P/s  variance  after  the  swap  is  the  sum 

We  can  similarly  describe  the  change  in  /Vs  variance  with  the  definitions 

3 

*i.t  ~  !C  (Co»[e/(n),ci(n)J  -  Cou(c,(n),c<(»)))  =  ^(g  -  /)Lr»; 

3 

A«r  -  ]£  (Coa(e/(ii), c/(»0)  -  Cou(c#(n), cj(n)))  =  -  -  /)%<*• 

}>i 

No  term  analogous  to  A,ms  is  necessary  since  there  are  no  clusters  in  Pr  with  indices  between  /  and  9. 

The  change  in  the  sum  of  /Vs  variance  with  Pr’*  variance  is  given  by  the  sum  of  all  the  A  terms 
defined  above.  Wc  will  show  that  the  sum  of  A  terms  is  bounded  from  above  by  0.  At  this  point  a  number  of 
observations  are  useful.  Since  all  Cj(«)  with  i  <  /  are  in  order,  it  follows  that  Lr  <  £$.  Thus  A/,g+Ai.r  <  0. 
It  remains  to  show  that  Ars  +  A/jr  +  A,\ig  <  0.  We  know  that 

3 

A/<*  +  A/<r  =  Rs)(g-/)a\  (1“) 


furthermore,  since  n/P  -  Lr+Rr+h  we  must  also  have  /?$  <  P7.  We  proceed  to  show  that  the  magnitude 
of  A.\ tt  is  no  greater  than  the  magnitude  of  (17)  and  consequently  prove  the  larger  result. 

„i  ss  n/P  -  Ls  -  Rs  —  1  is  the  number  of  clusters  in  Ps  whose  indices  lie  strictly  between  /  and  g.  A,\fg 
is  maximised  when  the  indices  of  these  clusters  are  as  large  as  possible;  when  k~g-l,g~2%...,g- m. 
With  such  indices,  the  sum  of  c/s  cc  terms  in  Ps  is 


.3  « 


1=1 


Likewise,  the  sum  of  c/s  cc  terms  in  Ps  is 


2  in 


3  £(»-(»  “/-«>)• 
<=1 


From  this,  we  see  that  A,\is  when  maximised  can  be  written  as 


3  jn 


=  jjy  £(n  -  i  ■  o)  -  ^  £(n  -(g-f-  i)a)  =  m($  -  /)<v. 

1=1  1=1 


But  note  that 


m  =  n/P-Ls-Rs-l 
<  n/P  —  Lt  —  Rs  —  1 
—  (n/ P  —  Lt  —  Rt  —  1)  +  [Rt  ~  Rs) 
=  (Rt  -  Rs), 
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>0  that 


&Hs  +  &Kr  +  *=  (- Wr  “  R$){9  -  /)<*  +  n»  •  (?  -  /)o)  <  0. 

Consequently,  swapping  c/  ami  c,  does  not  increase  the  sum  of  P$  and  /y’s  variance.  Furthermore,  the 
swap  does  not  affect  the  sum  of  other  processors'  variances.  Repeatedly  applying  this  procedure  puts  every 
cluster  in  place,  which  is  the  modular  assignment.  This  discussion  has  proved  the  following  theorem. 

Theorem  4  Let  (W(f))  he  a  second-order  stationery  process,  with  a  covariance  function  Cou(s)  =  <r3(l  - 
os),  where  0  <  o  <  2.  ltl  P  and  n  k  given  meh  that  P  dirides  n  evenly,  end  let  A,u  he  (he  P  x  n 
assignment  matrix  describing  the  modular  mopping.  Then  for  any  Pxn  assignment  matrix  A  describing  a 
balanced  assignment, 

P-l  /»-! 

o/p)  <  (i /p) 


In  the  event  that  the  workload  process  is  Gaussian  and  stationary,  we  can  show  that  increasing  the  degree 
reduces  tile  expected  maximum  processor  workload.  We  determine  the  processor  variance  and  covariance 
under  scatter  decomposition  by  substituting  the  values  given  by  (15)  into  (5)  and  (it).  Assume  that  i  <  j. 
Working  through  the  algebra  one  determines  that 

,,  x.  a /l- o/3  .  o(l-l /P)\ 

V  ar(/y(n))  =  <r3  - )  , 

and  that 

Cov(i,M. £,(«))  =  «’  • 

The  derivative  with  respect  to  n  of  C(£<(n)l£j(n))/l'ar(£(n)]  is  positive  if 

(4/3  -  2o/;t)(j  - «)  +  2cr/9  -  2/3  >  0. 

This  is  always  true  over  tin  range  o  6  (0,2).  Consequently  the  same  arguments  used  to  prove  Theorem  3 
can  be  applied  here. 

3  Summary 

Scatter  decomposition  is  an  attractive  method  for  mapping  domain-oriented  computations  with  irregular 
workloads  to  parallel  architectures.  Scatter  decomposition  partitions  the  domain  into  n  equal-size  pieces, 
and  maps  them  modularly  onto  P  processors.  This  paper  uses  a  formal  probabilistic  model  of  correlated 
workload  in  a  one-dimensional  domain  to  explain  why  and  when  scatter  decomposition  works.  First,  we 
show  that  periodicity  in  workload  correlation  can  lead  to  load  imbalance  under  scatter  decomposition  if  the 
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correlation  period  aligns  with  the  period  of  the  modular  mapping.  Consequently  we  consider  nonperiodic 
workload  correlation  functions. 

Our  first  result  shows  that  if  workload  correlation  is  a  convex  function  of  distance,  then  scattering  with 
increasingly  finer  grained  clusters  decreases  a  processor's  workload  variance,  thereby  increasing  the  average 
inter-processor  workload  correlation.  Since  the  processor  workload  mean  is  unaffected  by  this  change,  one 
anticipates  that  the  expected  maximum  workload  will  correspondingly  decrease. 

Our  second  result  affirms  this  intuition  under  a  stronger  set  of  assumptions:  the  workload  process  is 
Gaussian,  and  the  correlation  function  decreases  linearly  in  distance  until  it  reaches  zero  and  then  stays  at 
zero.  We  then  show  that  once  a  scatter  decomposition  is  sufficiently  fine-grained,  making  the  grain-size  finer 
reduces  the  expected  maximum  processor  workload. 

Our  third  result  shows  that  under  slightly  different  assumptions  still,  among  all  possible  “balanced” 
mappings  scatter  decomposition  minimizes  the  average  processor  workload  variance.  This  result  depends  on 
the  correlation  function  decreasing  linearly  across  the  entire  domain.  In  this  case  it  is  alto  true  that  if  tin 
workload  process  is  Gaussian,  then  scattering  a  finer-grained  decomposition  reduces  the  expected  maximum 
processor  workload. 

These  analytic  results  serve  to  formally  verify  the  intuition  behind  scatter  decomposition.  However, 
the  results  only  concern  load  balance.  The  additional  communication  cost  of  decreasing  granularity  is 
not  built  into  this  model.  Extensions  to  this  work  might  find  the  optimal  granularity  by  determining 
a  quantitative  estimator  of  the  expected  maximum  workload  and  the  expected  communication  cost  as  a 
function  of  granularity.  An  overall  execution  time  model  wouid  be  constructed  depending  on  the  influence 
of  architecture  on  the  communication  costs,  and  then  optimized. 
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